Pruned universal symbol sequences for LZW based language identification
نویسندگان
چکیده
We present a improved language modeling technique for Lempel-Ziv-Welch (LZW) based LID scheme. The previous approach to LID using LZW algorithm prepares the language pattern table using LZW algorithm. Because of the sequential nature of the LZW algorithm, several language specific patterns of the language were missing in the pattern table. To overcome this, we build a universal pattern table, which contains all patterns of different length. For each language it’s corresponding language specific pattern table is constructed by retaining the patterns of the universal table whose frequency of appearance in the training data is above the threshold. This approach reduces the classification score (Compression Ratio [LZW-CR] or the weighted discriminant score [LZW-WDS]) for non native languages and increases the LID performance considerably. Index Terms : Language modeling, PRLM, Pattern table, LZW-CR, LZW-WDS.
منابع مشابه
Low Complexity LID using Prune
We present two discriminative language modelling techniques for Lempel-Ziv-Welch (LZW) based LID system. The previous approach to LID using LZW algorithm was to directly use the LZW pattern tables for language modelling. But, since the patterns in a language pattern table are shared by other language pattern tables, confusability prevailed in the LID task. For overcoming this, we present two pr...
متن کاملمقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کاملSymbol Sequence Search from Telephone Conversation
We propose a method for searching for symbol sequences in conversations. Symbol sequences can include phone numbers, credit card numbers, and any kind of ticket (identification) numbers and are often communicated in call center conversations. Automatic extraction of these from speech is a key to many automatic speech recognition (ASR) applications such as question answering and summarization. C...
متن کاملImplementing Shared Memory on Mesh-Connected Computers and on the Fat-Tree
We present deterministic upper and lower bounds on the slowdown required to simulate an (n;m)-PRAM on a variety of networks. The upper bounds are based on a novel scheme that exploits the splitting and combining of messages. This scheme can be implemented on an n-node d-dimensional mesh (for constant d) and on an n-leaf pruned butter y and attains the smallest worst-case slowdown to date for su...
متن کاملCompsci 650 Applied Information Theory Lecture 4
Since the appearing probability of each English symbol P (a), ...P (b), ...P (” ”), ..., P (; ) is not uniform, we should be able to reduce the number of required bits. Based on the probability of each English symbol, we can compute the entropy H(E) ∼= 4.5 bits / char. If we use Huffman coding taught in this lecture to encode English keyboard, then we only need around 4.7 bits / char. Furthermo...
متن کامل